Ensemble learning for integrative prediction of genetic values with genomic variants

Background Whole genome variants offer sufficient information for genetic prediction of human disease risk, and prediction of animal and plant breeding values. Many sophisticated statistical methods have been developed for enhancing the predictive ability. However, each method has its own advantages and disadvantages, so far, no one method can beat others. Results We herein propose an Ensemble Learning method for Prediction of Genetic Values (ELPGV), which assembles predictions from several basic methods such as GBLUP, BayesA, BayesB and BayesCπ, to produce more accurate predictions. We validated ELPGV with a variety of well-known datasets and a serious of simulated datasets. All revealed that ELPGV was able to significantly enhance the predictive ability than any basic methods, for instance, the comparison p-value of ELPGV over basic methods were varied from 4.853E−118 to 9.640E−20 for WTCCC dataset. Conclusions ELPGV is able to integrate the merit of each method together to produce significantly higher predictive ability than any basic methods and it is simple to implement, fast to run, without using genotype data. is promising for wide application in genetic predictions.

and disadvantages.BayesA is mainly applicable for the traits controlled by genes with multiple tiny effects, whereas BayesB and BayesCπ are suitable for the traits controlled by a small number of main effect genes.In human disease risk prediction, "Clumping + Thresholding" (C + T) method has been developed and applied [9][10][11][12].C + T method first identifies a set of markers with predictive power, and then uses these markers to predict disease risk by logistic regression [1,13,14], which is suitable for the disease controlled by several main effect genes.Although several methods exist, each has its own limitations, so far, there is no one method that always outperforms others.
Ensemble learning is a machine learning method, which integrates the predictions from multiple methods to obtain a new prediction through supervised or unsupervised learning methods [15].As early as 20 years ago, it was found that ensemble learning can reduce generalization error [16] and ensemble methods that combine the output of multiple methods have been shown to achieve better generalizability than a single method [17].So far, ensemble learning has independently made a substantial impact on the field of bioinformatics through their widespread applications [18].One example is in predicting localization of long non-coding RNAs, where multiple sub-networks were used to integrate distinct feature sets to maximize method performance [19].In another work, a CNN/RNN (Convolutional Neural Networks/Recurrent Neural Network) ensemble was used to integrate features and raw sequence data to predict different types of translation initiation sites [20], overcoming the generalizability issue of traditional methods that can only predict a specific type of translational initiation sites.Moreover, the stability and reproducibility offered by ensemble methods such as in feature selection are also making a substantial impact in biomarker discovery [21,22].To our best knowledge, the remarkable flexibility and adaptability characters of ensemble learning has led to the proliferation of their application in bioinformatics research [23].
We herein propose an ensemble learning method for Prediction of Genetic Values (ELPGV).ELPGV trains several different basic methods, such as GBLUP, BayesA, BayesB and BayesCπ, to produce more accurate prediction.The core of ELPGV uses the hybrids of differential evolution [24] and particle swarm optimization [25] to train the weight, by which the predictions of basic methods are weighted averaged to generate new prediction.A variety of dataset including WTCCC (Wellcome Trust Case Control Consortium), IBDGC (International Inflammatory Bowel Disease Genetics Consortium), cattle, wheat and computational simulations are employed to validate ELPGV.

Basic methods
The prediction is based on a linear method according to Eq. ( 1): where y is the phenotypes; X is design matrix for fixed effects; α is the fixed effect; Z is genotypes of variants, coding with "0", "1" and "2" for genotypes "AA", "Aa" and "aa" respectively, or genotype dosages of SNPs; β is the SNP effects; and e is the residual errors, assumed to follow normal distribution,e ∼ N 0, Iσ 2 e , where I is a vector of iden- tity matrix and σ 2 e is the residual variance. (1) In this study, four basic methods are used for genetic value predictions, BayesA, BayesB, BayesCπ and GBLUP.In BayesA, all SNPs are assumed to contribute to genetic variation, and the variance of the SNP effect is assumed to follow inverse chi-square distribution; BayesB and BayesCπ assumes a small fraction ( π ) of SNPs have non-zero effects [6,8], where π is set as 0.1 in BayesB [26].The Bayesian methods are implemented with the function "BGLR" in the R package "BGLR" [27].In the GBLUP, the variances of all SNP effects are assumed to be equal, and then the genetic values are estimated with mixed model equation through kinship matrix constructed with SNPs [5].The GBLUP is implemented using the function "emmreml" in the R package "EMMREML" [28].

ELPGV model construction
The ELPGV framework comprises two components, weight training and weighted prediction.First, it trains basic methods to get predictions; then, it trains the weight of basic methods with machine learning; finally, it generates new predictions by the weighted average of the predictions of basic methods.The schematic diagram of the study methodology is given in Fig. 1.
Suppose n basic methods are investigated, the prediction of ELPGV can be expressed as Eq. ( 2), where, p j is the predicted values of the jth basic method, which is eas- ily obtained from each basic method, and W j is the weight of the jth basic method, respectively.
To train the weight W, a fitness function is defined as the correlation coefficient between the predicted values g predicted and observed values y observed (Eq.3), g predicted is the predicted values of ELPGV based on Eq. (2).
For testing population, phenotype y observed is unknown, we therefore introduce refer- ence genetic values to replace the unknown phenotypic values in Eq. ( 3).The genetic (2)  ELPGV uses a mixture of differential evolution (DE) algorithm and particle swarm optimization (PSO) algorithm to estimate the weight W, which includes initialize, mutation, crossover and selection steps.
Step 1. Initialization: ELPGV randomly initializes the weights W i• (W i,1 , . . ., W i,j ) and the optimization velocities V i• (V i,1 , . . ., V i,j ), for i = 1, . . ., m and j = 1, . . ., n , where m is the number of particles or the number of candidate weight; and n is the number of basic methods.The weight is initialized with Eq. ( 4) and the optimization velocity is initialized with Eq. ( 5).
First, the m group weights are replaced into Eq.( 2) to obtain the ELPGV predictions of m groups, respectively; then the predictions are replaced into Eq.( 3) to assess the cor- responding fitness for each group.We then define the optimal weight W (0) as the best fitness one in all group weights, Step 2. Mutation: In t th iteration, Eq. ( 4) is replaced with Eqs.(7) and (8) for updating the weight of each group, respectively.
where F is scaling factor, controlling the effect of difference vector, the index i = k = p = q.
Step 3. Crossover: The crossover operation switches the weight at current iteration (t) and last iteration (t − 1) randomly with Eq. ( 9), where CR is crossover probability and rand i• (0, 1) is a random value between 0 and 1 of i th group weight.
Step 4. Selection: Last, the all the group weights are updated with Eqs.(10) and (11).
After t th iteration, each group weight has a velocity which are updated as Eq. ( 12), where ε is inertia weight, c 1 and c 2 are accelerated factors. (4) else At the same time, ELPGV updates the fitness with new weights at t th updating with Eq. ( 3), the optimal weight can be expressed as Eq. ( 13) in t th iteration.
After the fitness meets a certain criterion, or the iterations reach the maximum number, ELPGV returns the optimal weights W and the predictions with Eq. (2).To reduce sampling error and increase the estimate accurate of weights, the whole estimates are repeated for 100 times and the averaged weights are taken for ELPGV (Table 1).

Monte Carlo cross-validation
Cross-validation was employed to evaluate the prediction performance of GS methods.The individuals of each dataset were first randomly divided into two parts with ratio 9:1, and they were taken as training set and testing set, respectively.The cross-validation was repeated 100 times.In the prediction, the phenotypes of individuals in testing set were masked, and the genetic values were predicted with training set; then the Pearson's correlation coefficient between the predicted values and their true phenotypes were used to evaluate the predictive ability of each method.

Paired-sample t-test
Because all the methods are compared with the same replicated dataset, we were able to compare ELPGV with other basic methods using paired-sample t-test, which is expressed as t = d/s d , with degree of freedom n − 1 , where n is the times of cross valida- tion and d is the difference of the predictive ability between ELPGV and other methods.

Inflammatory bowel disease (IBD) dataset
The inflammatory bowel disease dataset was accessed from the International IBD Genetics Consortium (IBDGC), including 20,155 Crohn disease (CD), 15,191 ulcerative colitis disease (UC) and 34,257 controls of European ancestry.In total, genotypes were called using optiCall for 192,402 autosomal variants before quality control.A total of 161,681 SNPs was available after removing the SNPs with MAF < 0.02 and p-value < 10e−5 from the HWE test.The missing genotypes were imputed with impute2 using 1000 genome as a reference.(For details, see refs.[31].To reduce computation burden, we further pruned SNPs for linkage disequilibrium with threshold r 2 = 0.5 using PLINK [30] and randomly sampled 1,000 individuals from Liege and Brussels batches.

Cattle dataset
German Holstein genomic prediction population was further employed to validate ELPGV, which comprised 5024 bulls [32], and all were genotyped with the Illumina Bovine SNP50 Beadchip [33].After removing the SNPs with HWE p-value < 10 − 4, CR < 0.95 and MAF < 0.01, a total of 42,551 SNPs remained for the downstream analysis.
The estimated breeding values of three traits milk fat percentage (mfp), milk yield (my), and somatic cell score (scs) were available and used in this study.

Wheat dataset
The wheat dataset was collected from CIMMYT's Global Wheat Program, the grain yields (GY) of the 599 wheat inbred lines were recorded for four places [34,35].Each wheat line was genotyped with 1447 Diversity Array Technology (DArT) by Triticarte Pty.Ltd, which had two genotypes coded with "0" or "1", to indicate its presence or absence, respectively, after filtering, 1279 markers were kept for analysis.

Simulations
We took advantage of the genotypes of wheat datasets for simulation.A number of QTL were simulated with effects sampled from gamma distribution with scale parameter 1.66 and shape parameter 0.4; the residual errors were sampled from normal distribution with variance set according to the heritability.We performed two simulation experiments to investigate the performance of ELPGV: (1) simulation of different number of QTLs, 5 and 1,000, respectively; and (2) simulation of different heritabilities, 0.5 and 0.2, respectively.This led to four sets of experiments (QTL5 and h 2 = 0.2, QTL5 and h 2 = 0.5, QTL1000 and h 2 = 0.2, QTL1000 and h 2 = 0.5).

Results
We used both real dataset and simulated dataset to validate the performance of ELPGV.In this study, four popular GS methods, GBLUP, BayesA, BayesB, and BayesCπ, were used for assembling with ELPGV, although ELPGV is able to assemble as many methods as possible.In addition, cross-validation was employed to evaluate the prediction performance of each method.Taken advantage of the fact that all the methods were compared with the same dataset, we used paired-sample t-test for significance comparison.
We then used ELPGV to assemble the predictions of four Basic methods to generate new predictions.To this end, we first evaluated the fitting effect of four basic methods with train set, the basic methods with the best fitting effect was used to generate the reference genetic values.The fitting effect was defined as the correlation between the estimated genetic values and the phenotypes in train set.It was found that BayesCπ usually had the best n than other methods.With reference genetic values, ELPGV assembled four basic methods to obtain new predictions, the average predictive ability of ELPGV across 100 validations was r = 0.8471, significantly higher than any basic methods with comparison p-value ranged from 1.090E−112 (GBLUP) to 6.458E−31 (BayesCπ) Table 3).Because we compared each method with the same dataset, we were able to compare ELGPV with four basic methods in each of 100 experiments, separately.Figure 2a-f shows the prediction abilities in each of experiment, ELPGV is more accurate than other four basic methods, and the advantage of ELPGV over GBLUP is more obvious.We also compared ELPGV with four basic methods in dataset of BD, CAD, T1D, RA and HT (Table 3).For all diseases, ELPGV was obviously more accurate than four basic methods with p-values ranged from 4.853E−118 to 9.640E−20 (Table 3).

IBD dataset
We also applied ELPGV to predict disease risk for IBD dataset of European ancestry.
The averaged predictive ability of 100 cross-validations of GBLUP, BayesA, BayesB and BayesCπ of UC was 0.6687, 0.7817, 0.7831 and 0.7845, respectively.After assembled with ELPGV, the averaged predictive ability was 0.7920, significantly higher than four basic methods, the p-values were from ranged from 3.314E−56 to 3.878E−13 (Table 4).Similarly, the prediction abilities of CD of four basic methods were ranged from 0.3692 (GBLUP) to 0.4452 (BayesCπ), after assembled with ELPGV, the predictive ability was increased to 0.4516, significantly higher than four basic methods (p-value varied from 3.659E−34 to 3.938E−07, Table 4).We also show the comparison of each experiment individually, for vast majority of individual experiment, ELPGV outperformed four basic methods, among them, GBLUP performed the lower predictive ability (Fig. 3a-c).

Cattle dataset
We further validated ELPGV with a cattle dataset of German Holstein, in which milk fat percent (mfp), milk yield (my) and somatic cell score (scs) were investigated.For genetic prediction of mfp, BayesCπ performed the highest predictive ability among four basic methods (r = 0.8632), whereas GBLUP performed the lowest (r = 0.8259) (Table 5).After assembled of four basic methods with ELPGV, the predictive ability was 0.8748, significantly higher than any basic methods (the comparison p-values ranged from 9.943E−80 to 2.356E−10, Table 5).The individual experiment showed that for vast majority of the predictions, ELPGV was obviously more accurate than four basic methods, especially than GBLUP (Fig. 4b).For my, ELPGV also outperformed the four basic methods   (Fig. 4a) with the comparison p-values ranged from 5.133E−52 (GBLUP) to 1.335E−07 (BayesB).For scs, the advantage of ELPGV over four basic methods was also significant, and the p-values were ranged from 3.801E−29 (GBLUP) to 0.001 (BayesCπ) (Table 5).
Figure 4 shows the accuracies of ELPGV and four basic methods in 100 individual experiments, which displays that for large proportion of predictions, ELPGV has higher prediction abilities than those of four basic methods for all the investigated traits.

Wheat dataset
Wheat yields measured under four places were investigated, which includes 599 individuals genotyped for 1279 SNPs.The averaged predictive ability across 100 cross validations of each place is shown in Table 6.For the first place, the prediction abilities of GBLUP, BayesA, BayesB and BayesCπ were 0.5251, 0.5231, 0.5080, and 0.5215, respectively, and the predictive ability of ELPGV was 0.5273, which was significantly higher than four basic methods (the comparison p-values ranged from 7.965E−19 to 2.297E−04 (Table 6).For other three places, the results also showed that the prediction accuracy of ELPGV was consistently higher than four basic methods (Table 6).All the predictions of 100 cross-validation are shown in Fig. 5a-d, and ELPGV outperforms four basic methods for majority of single experiments in four places.

Simulations
We finally performed simulation studies to further investigate the performance of ELPGV.For each group, 100 simulated datasets were generated.Each dataset was randomly divided into 5 parts evenly, and 4 of them were taken as train set and the left 1 part was taken as test set.We first ran four basic methods including GBLUP, BayesA, BayesB and BayesCπ; then assembled the predictions with ELPGV to produce new predictions.For all simulations, ELPGV performed significant higher prediction abilities than corresponding four basic methods, the comparison p-values were ranged from 3.553E−34 to 0.001E−00 (Table 7).The 100 replicated experiments also obviously revealed that for each of experiments, the prediction of ELPGV was more accurate than other basic methods (Fig. 6a-d) and the gain of ELPGV over GBLUP was more obvious when QTL number was 5 than 1,000.
We next investigated the effect of sample size of training set.We randomly sampled 100, 200, 300, 400, 500 and 599 individuals from wheat data, respectively, the QTL Table 6 The predictive ability of four basic methods and ELPGV, and the comparison p-value between ELPGV and others in GY with wheat dataset ELPGV is the ensemble learning based on BayesA, BayesB, BayesCπ and GBLUP -Represents no explicit result was found in this method

Method The first place
The number was set as 5 and the heritability was 0.5.For each of sample sizes, 100 independent datasets were generated.The cross validation was used to evaluate the prediction abilities.It revealed that the prediction abilities of ELPGV were higher than four  basic methods for all simulated sample size (Table 8).We next investigated if the advantage of ELPGV over other methods was dependent on the sample size.To do this, we summarized the maximum and minimum difference of the prediction abilities between  ELPGV and other methods, respectively (Table 8) and correlated the maximum (Fig. 7a) or minimum (Fig. 7b) differences to the corresponding sample sizes.But we did not find evidence of significant correlation (r = − 0.58 and − 0.076 with p-value 0.23 and 0.89), which implies that the gain of ELPGV over basic methods is not affected by sample size.

Discussion
We have presented an ensemble learning method, ELPGV to predict genetic values.The key feature of ELPGV is that it assembles predictions of other basic methods into more accurate predictions.Extensive datasets of human, cattle and wheat have been employed to validate the performance of ELPGV, all results consistently revealed that ELPGV was able to integrate the merit of each method together to produce significantly higher predictive ability than any basic methods.Based on these advantages, ELPGV is expected to be widely used for prediction in large data sets.Ensemble learning has been widely utilized in genome selection, such as Ma et al. [36], who assembles two basic methods and trains the weights with PSO algorithm; however, it has several disadvantages, (1) it assumes the phenotypes of testing individuals have been known, so that it is only applicable for prediction with known phenotype, which is less meaningful in practice; (2) the performance of the traditional PSO greatly depends on its parameters, and it often suffers from being trapped in local optima [37,38], which is consistent with the study of Cai et al. [39].Liang et al. [40] construct a stacking ensemble learning framework (SELF), integrating three machine learning methods and an ordinary least square regression was chosen as the meta learner, to improve the genomic predictions.A lot of experiment indicated that SELF with the great potential to improve genomic predictions in other animal and plant populations.In actual analysis, SELF taken the genomic relationship matrix derived by genotypes as the inputs directly.But this might reduce the prediction accuracy of a single basic method.Additionally, Gianola et al. [41] was found that bagging can ameliorate predictive performance of GBLUP and make it more robust against over-fitting.However, because of predictive ability increases with training set size [42].It is obvious that bagging may not be feasible for immense data sets.Fig. 7 The relationship between sample size and maximum difference or minimum difference of the methods DE algorithm is another kind of evolutionary algorithm, which has been applied to a series of problems arising in various fields of science, engineering, and management [43][44][45].In our analysis, we found that DE algorithm is much more stable, and always converge to the same solution after repeated operations; furthermore, DE converges fast and is very accurate for high-dimensional problems, which has three main parameters (initialize solution size, scaling factor F , crossover probability CR ), but it is not sensitive for parameter setups [46].While DE algorithm has many advantages, the disadvantage of it is that it is difficult to update model parameters [46], but PSO does not have this problem.So, hybridization is an important modification in DE which is implemented to enhance its performance and convergence speed.Plenty of work can be found in the literature on the hybridization of DE.For instance, Pant et al. [47] proposed a hybrid version of DE with PSO and results show that the proposed DE-PSO is quite competent for solving the considered test functions as well as real-life problems.Zhang et al. [48] proposed a hybrid technique using DE with PSO for unconstrained optimization problems.Similarly, ELPGV is the hybrid of DE and PSO too, which not only inherits the high precision merit of DE algorithm, but also possesses the fast convergence character of PSO algorithm.
In the prediction of the disease risk for human, ELPGV exhibits greater advantages over four basic methods.In almost all of situations, ELPGV is more accurate than others, the gain is much more obvious when comparing with GBLUP, reflecting that GBLUP is not very suitable for human dataset, may be due to the fact that the relationships between individuals are quite limited and few information is available for GBLUP predictions.In contrast, the situation is quite different for cattle and wheat datasets.The reason may be that the aim of these datasets is for selection breeding and the individuals have extensive relationship, which is consistent with the literature [49].Additionally, Heslot et al. [50], Azodi et al. [51] and Schrauf et al. [52] also compared GBLUP (or equivalent models) with other genomic prediction methods in a variety of plant datasets and have shown that the difference between GBLUP and other methods is negligible under large data sizes and polygenic architectures.Because the GBLUP efficiently predicts individual genetic values using the relationship information, and all markers are assumed in a sense to contribute equally to the construction of Kinship matrix.
It is shown that the performance of ELPGV is greatly affected by the method similarity, which is consistent with Granitto et al. [53] who concludes diverse basic methods is an essential characteristic of a good ensemble method.Therefore, one way to improve the performance of ELPGV is to increase the diversity of basic methods.For example, BayesB, BayesCπ and BayesR [54] are working well for major-effect QTL method, they often performed similar prediction abilities, so integrating them would not enhance the predictive ability of ELPGV too much; similarly, rrBLUP [55] is theoretically quite similar to GBLUP, both are based on polygenic method, it would not substantially increase the predictive ability by integrating them together.
We have proposed ELPGV method for optimizing the parameters, which greatly improves the precise of parameter estimates.It's versatility to allow for different and more complex criterion to be maximized.However, it still has room to improve, for example, combining DE or PSO with other optimization algorithms to form a better hybrid algorithm [46], or using other ensemble strategies, such as sequence integration methods such as boosting method [56].

Conclusions
We have presented an ensemble learning method, ELPGV, to predict genetic values.The key feature of ELPGV is that it assembles predictions of other basic methods into more accurate predictions.ELPGV is able to integrate the merit of each method together to produce significantly higher predictive ability than any basic methods and it is simple to implement, which uses only the predictions of basic methods as input without using genotype data.Therefore, ELPGV requires quite few computers RAM and can complete task even with PC computer; furthermore, ELPGV is computationally fast, which takes only several minutes to complete the assembling for tens thousands of individuals and is promising for wide application in genetic predictions.

Parameters Value W min minimum weight 0 W max maximum weight 1 V min minimum update velocity − 0. 01 V max maximum update velocity 0. 01 m the weight size 20 F scaling factor 0. 5 CR crossover probability 0. 3 ε inertia weight 1 c 1 accelerated factor 1 2 c 2 accelerated factor 2 2
Max_iterations25

Fig. 2
Fig. 2 Comparison of the predictive ability of ELPGV and the basic method.a T1D, b BD, c RA, d T2D, e CAD and f HT with WTCCC dataset; different method is denoted with different color, each dot represents single experiment

Fig. 3
Fig. 3 Comparison of the predictive ability of ELPGV and the basic method.a CD, b UC and c IBD with IBDGC dataset; different method is denoted with different color, each dot represents single experiment

Fig. 4
Fig. 4 Comparison of the predictive ability of ELPGV and the basic methods.a my, b mfp and c scs with cattle dataset; different method is denoted with different color each dot represents single experiments

Fig. 5
Fig. 5 Comparison of the predictive ability of ELPGV and the basic method.a-d GY (grain yield) under four places of CIMMYT wheat dataset; different method is denoted with different color; each dot represents single experiment

Fig. 6
Fig.6 Comparison of the predictive ability of ELPGV and the basic method in simulation.a 5 QTL with heritiability 0.2; b 1000 QTL with heritiability 0.2; c 5 QTL with heritiability 0.5 and d 1000 QTL with heritiability 0.5.Different method is denoted with different color, and each dot represents single experiment

Table 3
The predictive ability of four basic methods and ELPGV, and the comparison p-value between ELPGV and others in T1D, T2D, BD, RA, CAD, HT with WTCCC dataset ELPGV is the ensemble learning based on BayesA, BayesB, BayesCπ and GBLUP -Represents no explicit result was found in this method type 1 diabetes (T1D), type 2 diabetes (T2D), bipolar disorder (BD), rheumatoid arthritis (RA), coronary artery disease (CAD) and hypertension (HT)

Table 4 The
predictive ability of four basic methods and ELPGV, and the comparison p-value between ELPGV and others in CD, UC, IBD with IBDGC dataset ELPGV is the ensemble learning based on BayesA, BayesB, BayesCπ and GBLUP -Represents no explicit result was found in this method Crohn disease (CD) and ulcerative colitis disease (UC)

Table 7
The averaged predictive ability across 100 replications for different methods in 4 scenes of simulation

Table 8
The maximum and minimum difference of the predictive ability between ELPGV and other methods in different sample size of simulation